98 research outputs found

    Personalized PageRank with Node-dependent Restart

    Get PDF
    Personalized PageRank is an algorithm to classify the improtance of web pages on a user-dependent basis. We introduce two generalizations of Personalized PageRank with node-dependent restart. The first generalization is based on the proportion of visits to nodes before the restart, whereas the second generalization is based on the probability of visited node just before the restart. In the original case of constant restart probability, the two measures coincide. We discuss interesting particular cases of restart probabilities and restart distributions. We show that the both generalizations of Personalized PageRank have an elegant expression connecting the so-called direct and reverse Personalized PageRanks that yield a symmetry property of these Personalized PageRanks

    SciRecSys: A Recommendation System for Scientific Publication by Discovering Keyword Relationships

    Full text link
    In this work, we propose a new approach for discovering various relationships among keywords over the scientific publications based on a Markov Chain model. It is an important problem since keywords are the basic elements for representing abstract objects such as documents, user profiles, topics and many things else. Our model is very effective since it combines four important factors in scientific publications: content, publicity, impact and randomness. Particularly, a recommendation system (called SciRecSys) has been presented to support users to efficiently find out relevant articles

    Large Scale Spectral Clustering Using Approximate Commute Time Embedding

    Full text link
    Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to O(n3)O(n^3) and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods

    Do logarithmic proximity measures outperform plain ones in graph clustering?

    Full text link
    We consider a number of graph kernels and proximity measures including commute time kernel, regularized Laplacian kernel, heat kernel, exponential diffusion kernel (also called "communicability"), etc., and the corresponding distances as applied to clustering nodes in random graphs and several well-known datasets. The model of generating random graphs involves edge probabilities for the pairs of nodes that belong to the same class or different predefined classes of nodes. It turns out that in most cases, logarithmic measures (i.e., measures resulting after taking logarithm of the proximities) perform better while distinguishing underlying classes than the "plain" measures. A comparison in terms of reject curves of inter-class and intra-class distances confirms this conclusion. A similar conclusion can be made for several well-known datasets. A possible origin of this effect is that most kernels have a multiplicative nature, while the nature of distances used in cluster algorithms is an additive one (cf. the triangle inequality). The logarithmic transformation is a tool to transform the first nature to the second one. Moreover, some distances corresponding to the logarithmic measures possess a meaningful cutpoint additivity property. In our experiments, the leader is usually the logarithmic Communicability measure. However, we indicate some more complicated cases in which other measures, typically, Communicability and plain Walk, can be the winners.Comment: 11 pages, 5 tables, 9 figures. Accepted for publication in the Proceedings of 6th International Conference on Network Analysis, May 26-28, 2016, Nizhny Novgorod, Russi

    Euclidean Distances, soft and spectral Clustering on Weighted Graphs

    Get PDF
    We define a class of Euclidean distances on weighted graphs, enabling to perform thermodynamic soft graph clustering. The class can be constructed form the "raw coordinates" encountered in spectral clustering, and can be extended by means of higher-dimensional embeddings (Schoenberg transformations). Geographical flow data, properly conditioned, illustrate the procedure as well as visualization aspects.Comment: accepted for presentation (and further publication) at the ECML PKDD 2010 conferenc

    Information flow in interaction networks II: channels, path lengths and potentials

    Full text link
    In our previous publication, a framework for information flow in interaction networks based on random walks with damping was formulated with two fundamental modes: emitting and absorbing. While many other network analysis methods based on random walks or equivalent notions have been developed before and after our earlier work, one can show that they can all be mapped to one of the two modes. In addition to these two fundamental modes, a major strength of our earlier formalism was its accommodation of context-specific directed information flow that yielded plausible and meaningful biological interpretation of protein functions and pathways. However, the directed flow from origins to destinations was induced via a potential function that was heuristic. Here, with a theoretically sound approach called the channel mode, we extend our earlier work for directed information flow. This is achieved by constructing a potential function facilitating a purely probabilistic interpretation of the channel mode. For each network node, the channel mode combines the solutions of emitting and absorbing modes in the same context, producing what we call a channel tensor. The entries of the channel tensor at each node can be interpreted as the amount of flow passing through that node from an origin to a destination. Similarly to our earlier model, the channel mode encompasses damping as a free parameter that controls the locality of information flow. Through examples involving the yeast pheromone response pathway, we illustrate the versatility and stability of our new framework.Comment: Minor changes from v3. 30 pages, 7 figures. Plain LaTeX format. This version contains some additional material compared to the journal submission: two figures, one appendix and a few paragraph

    Semantic distillation: a method for clustering objects by their contextual specificity

    Full text link
    Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence, Springer-Verla

    Soft Image Segmentation: On the Clustering of Irregular, Weighted, Multivariate Marked Networks

    Get PDF
    The contribution exposes and illustrates a general, flexible formalism, together with an associated iterative procedure, aimed at determining soft memberships of marked nodes in a weighted network. Gathering together spatial entities which are both spatially close and similar regarding their features is an issue relevant in image segmentation, spatial clustering, and data analysis in general. Unoriented weighted networks are specified by an ``exchange matrix", determining the probability to select a pair of neighbors. We present a family of membership-dependent free energies, whose local minimization specifies soft clusterings. The free energy additively combines a mutual information, as well as various energy terms, concave or convex in the memberships: within-group inertia, generalized cuts (extending weighted Ncut and modularity), and membership discontinuities (generalizing Dirichlet forms). The framework is closely related to discrete Markov models, random walks, label propagation and spatial autocorrelation (Moran's I), and can express the Mumford-Shah approach. Four small datasets illustrate the theory

    Learning Shape Segmentation Using Constrained Spectral Clustering and Probabilistic Label Transfer

    Get PDF
    International audienceWe propose a spectral learning approach to shape segmentation. The method is composed of a constrained spectral clustering algorithm that is used to supervise the segmentation of a shape from a training data set, followed by a probabilistic label transfer algorithm that is used to match two shapes and to transfer cluster labels from a training-shape to a test-shape. The novelty resides both in the use of the Laplacian embedding to propagate must-link and cannot-link constraints, and in the segmentation algorithm which is based on a learn, align, transfer, and classify paradigm. We compare the results obtained with our method with other constrained spectral clustering methods and we assess its performance based on ground-truth data

    Scuba:Scalable kernel-based gene prioritization

    Get PDF
    Abstract Background The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. Results We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Conclusions Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba
    corecore